Based on a sample of data from online media (blogs, news, twitter), this reports aims to highlight which are the most frequent word and word combinations in modern english. The goal is to build a tool that could efficiently predict the next word of a person who is typing text in real time.
Some of the challenges with natural language processing (NLP) are the below :
Some of the computational challenges with NLP :
The datasets leveraged for this excercise stem from 3 types of media ranging across the phasm of casual (twitter) to formal (news) language. They can already be considered a sample and therefore no initial sampling will be used at the exploration phase in an attempt to understand the capabilities of a household device.
The twitter object is 2.614323810^{8} bytes, while the news object is 6.257480110^{8} bytes, and the blog object is 8.281661810^{8} bytes.
all 3 sizes are significant and will need special handling to stay within the computational capabilities of a house-hold system.
The next step is to to re-morph data into a dataframe with one variable per row and the count of appearances in the data as its frequency.
!! For news and blog sources, the number of columns cannot be handled in a 8G RAM laptop. Hence I need to split the original tables in two for faster processing (ie columns>60).
In terms of clean up, punctuation, capitalization and numerical characters are excluded to produce a more uniform dataset…
After the clean-up … A sneak peak into the data :
## Source: local data frame [6 x 2]
##
## Var1 Freq
## (chr) (int)
## 1 the 1748107
## 2 and 1025288
## 3 to 1004097
## 4 a 844858
## 5 of 823929
## 6 i 724972
In my datasets the 90th percentile, appears for frequencies : 10 , 10 or 15 depending on the table.
Only 1% of my words have a frequency higher than 325.28 , 445 or 626 (ranging for each data table).
The 10th and the 50th percentile appear in all tables for frequency = 1 , which means that my data has a really long tail of words that have appeared only once and are therefore highly unprobable.
Some thoughts before modeling the data
My model could take into account the context of each person typing and for that each source table (twitter, news, blogs) could be leveraged to produce different “context” variables.
In a first take though, I am looking to come-up with a context-less model for simplicity, although performance might be compromised by design. Therefore, I will merge the 3 tables. However, given the different table sizes I am looking to avoid skewing the frequencies towards the table size bias. To do so I need to transform net frequencies into probabilities of words to appear by dividing with the total number of word occurencies.
It is easy to spot by the above visualisation which words are rather specific to a medium ie the word “happy” on twitter and which words are rather close in probabilities across media ie the word “home”.
Next steps before starting to model the data is start analyzing word combinations, ie ngrams. This will help further boost performance of a potential model as allegedly, the probability of a word combo (of 2,3 or more words) could be used as a better predictor than the probability of a single word.
From this first phase a key learning is that the top 1% of most frequent words is “staples” words (articles, connectors), not providing context. An idea to explore is categorizing ngrams in contextual and context-less so as to develop a model that is aware of its potential to predict (contextual) or not predict (context-less). For example 1. when previous word is context-less ie “the” then the capability to predict is potentially low (equals the probability of a specific noun / that of all nouns) 2. when the previous words is contextual ie “mayor” then the capability to predict is high and equals the probability of any member of the % to all the members of the 1% (largest group amongst all, with an exponential difference!!)
The code for this file can be found on my github : https://github.com/pi-georgia/Capstone-_-Swiftkey.git